1 Morning

REFRESHER: Using R Studio

TIPS

  • Use Tab to auto complete
  • Use up arrow to get previous command

1.0.1 Create a new project in R Studio

  1. From File menu select New Project…
  2. From New Project Dialog select New Directory
  3. Select New Project as project type.
  4. Give a project name, i.e. “Workshop” and click Create Project button.

1.0.2 Create a new script.

  1. From the File menu select New File > R Script
  2. Save file as “my_Rscript.R”

Write commands in code editor of R Studio and run them using icon Run in R Studio.

1.1 Prepare counts and metadata tables from RNA-seq results files

We will download files containing gene counts per sample for an example RNA-seq project, and combine them into a single counts table. We’ll also prepare a metadata file for this project.

First, download the table of zip counts at this URL:

https://wd.cri.uic.edu/edgeR/featurecounts.zip

This file will be, by default, saved to your Downloads folder. Use your system’s file explorer to navigate to this folder, or enter the following into its address bar:

  • Windows: %USERPROFILE%/Downloads
  • Mac or Linux: ~/Downloads

Once you find the featurecounts.zip file, you will need to unzip it.

  • Windows: Right click on the file and click “Extract all…”, then click the “Extract” button to extract the zip file to the same location as the zip file.
  • Mac: Double click on the file to extract the zip file to the same location as the zip file.

There should be 6 files extracted which were generated by the featureCounts program. Each file has gene expression counts for one sample. Here is a preview of the first one (note that long fields are truncated):

# Program:featureCounts v2.0.3; Command:"featureCounts" "-t" "exon" "-g" "gene_id" "-p"
    "-T" "1" "-s" "1" "-a" "mm9.gtf" "-o" "A.1" "A.1.bam"
Geneid   Chr          Start        End          Strand       Length  A.1.bam
Xkr4     chr1;chr...  3204563;...  3207049;...  -;-;-        3634    0
Rp1      chr1;chr...  4280927;...  4283093;...  -;-;-;-;...  9747    7
Sox17    chr1;chr...  4481009;...  4482749;...  -;-;-;-;...  3130    229
Mrpl15   chr1;chr...  4763279;...  4764597;...  -;-;-;-;...  4203    80
Lypla1   chr1;chr...  4797974;...  4798063;...  +;+;+;+;...  2433    184
Tcea1    chr1;chr...  4847775;...  4848057;...  +;+;+;+;...  2847    108
Rgs20    chr1;chr...  4899657;...  4900743;...  -;-;-;-;...  2241    0
Atp6v1h  chr1;chr...  5073254;...  5073359;...  +;+;+;+;...  1976    233

There’s a comment line (first line starting with #) giving the command used to generate this file. After that the fields in the table give the gene ID, all exon coordinates and strands (Chr, Start, End, and Strand), total gene length in bp, and read counts for sample A.1. The read counts (last column) is the data we want for each sample, along with the gene IDs.

What we need to do is:

  1. Make a list of files to combine
  2. Read each file into R
  3. Create a table of the gene ID and gene counts for each sample
  4. Write this to a new file
  5. Prepare a metadata file

The steps below assume that the genes are in the same order in each file (we use cbind() without checking that the gene names match). In practice this is true for featureCounts files, but it is also something that is good to double-check in general.

1.1.1 Make a list of all of the files ending in “counts.txt”.

This assumes you’ve downloaded all of the files to your “Downloads” folder and unpacked them there.

On Windows:

fc.files <- list.files("~/../Downloads/featurecounts",
   pattern="*.counts.txt", full.names=T)
fc.files
[1] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.1.counts.txt"
[2] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.2.counts.txt"
[3] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/A.3.counts.txt"
[4] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.1.counts.txt"
[5] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.2.counts.txt"
[6] "C:\\Users\\jasonmw\\Documents/../Downloads/featurecounts/B.3.counts.txt"

On Mac or Linux:

fc.files <- list.files("~/Downloads/featurecounts",
   pattern="*.counts.txt", full.names=T)
fc.files
[1] "/home/jasonmw/Downloads/featurecounts/A.1.counts.txt"
[2] "/home/jasonmw/Downloads/featurecounts/A.2.counts.txt"
[3] "/home/jasonmw/Downloads/featurecounts/A.3.counts.txt"
[4] "/home/jasonmw/Downloads/featurecounts/B.1.counts.txt"
[5] "/home/jasonmw/Downloads/featurecounts/B.2.counts.txt"
[6] "/home/jasonmw/Downloads/featurecounts/B.3.counts.txt"

Note that the file paths will reflect where the files are on YOUR computer.

1.1.2 Read each file into R

We will use the lapply() function to have the read.delim() function read each of these files into R and return a list containing the data. We need to tell the read.delim() function to ignore comment lines (starts with #). We also need to subset the data frame read in from each file to just column 6 (gene counts). Note that this was column 7, but we set the first column as row names so the numbers shifted. Finally, we need to set drop=F to keep it structured as a data frame, otherwise it would become a vector and we would loose the row names.

fc.data <- lapply(fc.files,
   function(x) { read.delim(x, row.names=1, comment="#")[, 6, drop=F] })

1.1.3 Create a table of the gene ID and gene counts for each sample

Now that we have all of the data loaded into a list, we will use the Reduce() function combined with the cbind() function to combine the data into a single data frame. The Reduce() function will iteratively apply a function to a list (note that we could have done this with a for loop instead). We can use cbind() because we know that all of the data tables loaded have the same rows.

counts.table <- Reduce(cbind, fc.data)
head(counts.table, n=3)

We need to remove the “.bam” suffixes from the column names using the gsub() function (global substitution). Note that the $ symbol means: end of the line.

colnames(counts.table) <- gsub(".bam$", "", colnames(counts.table))
head(counts.table, n=3)

1.1.4 Write this to a new file

To get the best output, we will copy the rownames to a new column and save the table as a tab separated file.

counts.table <- cbind(Gene=rownames(counts.table), counts.table)
write.table(counts.table, "combined_counts.txt", sep="\t", row.names=F, quote=F)

You now have a counts table.

1.1.5 Prepare a metadata file

To make the metadata file, open up Microsoft Excel and create a file as follows:

Then choose Save As > Tab delimited Text (.txt) and save the file to a location you can find. You now have a metadata table.

1.2 Pairwise differential analysis in edgeR

Start a new R script and save it to your computer as “edgeR.R”. Execute the edgeR commands one-at-a-time.

ABOUT: These data are from an RNA-seq experiment with two experimental groups, A and B. There are three replicates per group.

1.2.1 Load the edgeR library

library(edgeR)

1.2.2 Load the read counts and metadata tables

data <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_counts.txt", row.names=1)
dim(data)
## [1] 23420     6
head(data, n=3)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_metadata.txt", row.names=1)
metadata

1.2.3 Verify the metadata row order matches the data column order

Check the data again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 23420     6
head(data, n=3)

1.2.4 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14193     6

1.2.5 Create edgeR object

genes <- DGEList(counts=data_subset, group=metadata[, 1])
genes
## An object of class "DGEList"
## $counts
##        A.1 A.2 A.3 B.1 B.2 B.3
## Rp1      7  18  28  18   8   9
## Sox17  229 451 234 367 456 436
## Mrpl15  80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1  108 192 148 175 208 219
## 14188 more rows ...
## 
## $samples
##     group lib.size norm.factors
## A.1     A  3326193            1
## A.2     A  5032893            1
## A.3     A  5352456            1
## B.1     B  4979670            1
## B.2     B  5620129            1
## B.3     B  5886895            1

1.2.6 Calculate TMM factors

genes <- calcNormFactors(genes)
genes
## An object of class "DGEList"
## $counts
##        A.1 A.2 A.3 B.1 B.2 B.3
## Rp1      7  18  28  18   8   9
## Sox17  229 451 234 367 456 436
## Mrpl15  80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1  108 192 148 175 208 219
## 14188 more rows ...
## 
## $samples
##     group lib.size norm.factors
## A.1     A  3326193    1.0236147
## A.2     A  5032893    1.0200193
## A.3     A  5352456    0.8468997
## B.1     B  4979670    1.1083495
## B.2     B  5620129    0.9269172
## B.3     B  5886895    1.1007924

1.2.7 Estimate dispersion

genes <- estimateDisp(genes)
genes
## An object of class "DGEList"
## $counts
##        A.1 A.2 A.3 B.1 B.2 B.3
## Rp1      7  18  28  18   8   9
## Sox17  229 451 234 367 456 436
## Mrpl15  80 130 104 157 231 207
## Lypla1 184 215 189 292 344 377
## Tcea1  108 192 148 175 208 219
## 14188 more rows ...
## 
## $samples
##     group lib.size norm.factors
## A.1     A  3326193    1.0236147
## A.2     A  5032893    1.0200193
## A.3     A  5352456    0.8468997
## B.1     B  4979670    1.1083495
## B.2     B  5620129    0.9269172
## B.3     B  5886895    1.1007924
## 
## $common.dispersion
## [1] 0.07726973
## 
## $trended.dispersion
## [1] 0.15541353 0.05724855 0.06490358 0.05832588 0.06245742
## 14188 more elements ...
## 
## $tagwise.dispersion
## [1] 0.17099271 0.05072614 0.04405357 0.03575971 0.03376216
## 14188 more elements ...
## 
## $AveLogCPM
## [1] 1.746894 6.165060 4.900612 5.724314 5.127011
## 14188 more elements ...
## 
## $trend.method
## [1] "locfit"
## 
## $prior.df
## [1] 4.134752
## 
## $prior.n
## [1] 1.033688
## 
## $span
## [1] 0.2945153

1.2.8 Calculate Biological Coefficient of Variation (BCV)

sqrt(genes$common.dispersion)
## [1] 0.2779743
plotBCV(genes)

1.2.9 Run differential analysis

NOTE: logFC will be computed as B/A: positive means higher in B edgeR calculates this as the second group over the first group from exactTest

stats <- exactTest(genes)
stats
## An object of class "DGEExact"
## $table
##              logFC   logCPM     PValue
## Rp1    -0.93760389 1.746894 0.12038547
## Sox17   0.09658678 6.165060 0.72540551
## Mrpl15  0.54242240 4.900612 0.04323394
## Lypla1  0.36738965 5.724314 0.12314337
## Tcea1   0.04750527 5.127011 0.84813699
## 14188 more rows ...
## 
## $comparison
## [1] "A" "B"
## 
## $genes
## NULL

1.2.10 Run False Discovery Rate (FDR) correction

We will utilize the Benjamini-Hochberg (BH) method

stats$table$QValue <- p.adjust(stats$table$PValue, method="BH")
head(stats$table, n=3)

Check the number of significant genes (QValue < 0.05)

sum(stats$table$QValue < 0.05)
## [1] 684

1.2.11 Calculate normalized expression

We will utilize the cpm() function (Counts per Million) to normalize the expression data.

norm <- cpm(genes)
head(norm, n=3)
##              A.1       A.2       A.3       B.1       B.2       B.3
## Rp1     2.055957  3.506278  6.176934  3.261333  1.535687  1.388835
## Sox17  67.259172 87.851756 51.621518 66.494965 87.534172 67.281361
## Mrpl15 23.496654 25.323122 22.942897 28.446075 44.342969 31.943215

1.3 PCA plot

You can use the normalized expression data from the previous exercise. If you are behind or had issues on the previous exercise, the following will load the needed data for this exercise.

norm <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_norm.txt", row.names=1)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_metadata.txt", row.names=1)

1.3.1 Log-scale the data first

This is good practice, as gene expression varies over several orders of magnitude. We need to add a pseudo-count (0.1) to avoid taking the log of 0

norm.log <- log2(norm + 0.1)

1.3.2 Run Principal Component Analysis (PCA)

We need to transpose the log-scaled normalized data using the t() function. We want to plot the SAMPLES, but the data table is currently setup for GENES

pca <- prcomp(t(norm.log))

1.3.3 Examine the summary of the PCA

We need to check the proportion of variance for each principal component (PC).

summary(pca)
## Importance of components:
##                            PC1     PC2     PC3      PC4      PC5       PC6
## Standard deviation     58.6293 31.5442 27.0428 24.24291 19.72479 7.086e-14
## Proportion of Variance  0.5598  0.1620  0.1191  0.09571  0.06336 0.000e+00
## Cumulative Proportion   0.5598  0.7218  0.8409  0.93664  1.00000 1.000e+00

1.3.4 Create labels for the X and Y axes

We will extract the proportion of variance for each PC from the summary() function. We will use the sprintf() function to create labels for the X and Y axis. %.2f is a formatting code to print a floating point number with 2 decimal places. %% is the formatting code to display a percent symbol.

importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])

1.3.5 Format the PCA data for use with ggplot2

We will use cbind() to merge the principal components data with the metadata for each sample.

pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))
head(pca.data, n=3)

1.3.6 Create a PCA plot using ggplot2

We need to load the ggplot2 library. We will use the geom_point() method to display the PCA data, and we will use geom_text() to add sample labels; hjust=0 and vjust=0 makes it left-justified. The coord_cartesian() method with clip='off' set allows the labels go outside the plot boundary.

library(ggplot2)
ggplot(pca.data, aes(x=PC1, y=PC2, label=Sample, color=Group)) +
  geom_point() + geom_text(hjust=0, vjust=0, show.legend=F, color="black") +
  coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel)

1.3.7 Create a Screeplot using ggplot2 (optional)

We need to first format the variance data for use with ggplot2 using the data.frame() method.

importance.df <- data.frame(PC=names(importance), Variance=importance)
ggplot(importance.df, aes(x=PC, y=Variance)) + geom_col() +
  labs(x="Principal Component", y="Variance Explained")

1.4 Heatmap

We will make a heatmap of z-scored log2 CPM values for all differentially expressed genes (FDR < 0.05). We will specify our own color scheme and limit the ranges of z-scores that are plotted.

You can use the stats and norm.log variables we created earlier. If you are behind or had issues on that exercise, the following will load the needed data for this exercise.

stats <- readRDS(url("https://wd.cri.uic.edu/edgeR/stats.rds"))
norm.log <- read.delim("https://wd.cri.uic.edu/edgeR/pairwise_lognorm.txt", row.names=1)

1.4.1 Load the ComplexHeatmap and circlize libraries

We will use ComplexHeatmap to create the heatmap figure and we will use circlize to generate colors.

library(ComplexHeatmap)
library(circlize)

1.4.2 Get the log-scaled CPMs of differentially expressed genes

We will use the more stringent Q-value that we calculated earlier. Note that the norm and stats table will be in the same row order, so this will work.

degs.norm <- norm.log[stats$table$QValue < 0.05,]
dim(degs.norm)
## [1] 684   6

1.4.3 Z-score across the genes

This will account for gene-to-gene differences in baseline expression. We need to transpose twice (t() function), as the scale() function works on columns

degs.norm.z <- t(scale(t(degs.norm)))

1.4.4 Setup the colors we will use

This will scale the colors from blue to white to red, with max/min at +/-2 and the middle at 0.

col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

1.4.5 Create the heatmap

We will create a basic heatmap, turning off the row names and using the row_split and column_split options to add some space between clusters in the data. The draw() function finalizes the image created by the Heatmap() function and allows us to save the image to variable ht while also plotting the figure to the display device; this is the ComplexHeatmap way of doing things.

ht <- draw(Heatmap(degs.norm.z, show_row_names=F, name="Z-scored log2 CPM",
  col=col_fun, row_split=2, column_split=2))

1.4.6 Get the order of genes in the heatmap plot

Notice that sample B.2 is a little bit “different” from the other B samples, with more extreme colors. Sample B.2 is also the more “different” one in the PCA plot. Overall, B.2 is more different from the group A samples than the other B samples, which we can see in both the PCA plot and the heatmap. How can we get the genes that correspond to the heatmap?

genes_order <- row_order(ht)

1.4.6.1 Get gene names for the first cluster

Note that genes_order is a list of 2 vectors. Each vector is the order (indices) of the genes for each row cluster. To get the actual gene names for each cluster, we would need to select the names from the degs.norm.z based on the indices in the vectors.

cluster1_genes <- rownames(degs.norm.z)[ genes_order[[1]] ]
head(cluster1_genes, n=3)
## [1] "Eif2s3y" "Fbn2"    "Rbm20"

1.4.6.2 Use an apply statement to get the genes from all clusters

Note that gene_clusters will also be a list.

gene_clusters <- sapply(genes_order, function(x) rownames(degs.norm.z)[x])
head(gene_clusters[[1]], n=3)
## [1] "Eif2s3y" "Fbn2"    "Rbm20"
head(gene_clusters[[2]], n=3)
## [1] "Ptafr"  "Fkbp10" "Il1b"

1.4.7 Create a tall heatmap with gene names displayed (optional)

You can save a really tall heatmap (with gene names turned ON) to a PDF to double-check the gene orders.

pdf("pairwise_long_heatmap.pdf", height=100)
Heatmap(degs.norm.z, show_row_names=T, name="Z-scored log2 CPM",
  col=col_fun, row_split=2, column_split=2)
dev.off()

1.5 One-way ANOVA

We will read in the data, set up our model, and estimate the dispersion similarly to what we did before. Then we will use glmQLFit() to fit the model, and glmQLFTest() to test the model

ABOUT: These data are from an RNA-seq experiment comparing two different disease models (model1 and model2) to a healthy control group, so there are 3 experimental groups in total. There are 4 replicates per group.

1.5.1 Load the edgeR library

It is safe to skip this step if you have already load the library in Rstudio.

library(edgeR)

1.5.2 Read in counts table

data <- read.delim("https://wd.cri.uic.edu/edgeR/anova_counts.txt", row.names=1)
dim(data)
## [1] 24421    12
head(data, n=3)

1.5.3 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/anova_metadata.txt", row.names=1)
head(metadata, n=3)

1.5.4 Make sure the metadata row order matches the data column order

We will check the data again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 24421    12

1.5.5 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 16713    12

1.5.6 Define our factors

treat <- factor(metadata[, 1])
treat
##  [1] Control Control Control Control Model1  Model1  Model1  Model1  Model2 
## [10] Model2  Model2  Model2 
## Levels: Control Model1 Model2

1.5.7 Define our model matrix

model <- model.matrix(~treat)
model
##    (Intercept) treatModel1 treatModel2
## 1            1           0           0
## 2            1           0           0
## 3            1           0           0
## 4            1           0           0
## 5            1           1           0
## 6            1           1           0
## 7            1           1           0
## 8            1           1           0
## 9            1           0           1
## 10           1           0           1
## 11           1           0           1
## 12           1           0           1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treat
## [1] "contr.treatment"

1.5.8 Create edgeR object

We will create the edgeR object without setting the groups this time.

genes <- DGEList(counts=data_subset)

1.5.9 Calculate TMM factors

genes <- calcNormFactors(genes)

1.5.10 Estimate dispersion

genes <- estimateDisp(genes, model)

1.5.11 Calculate Biological Coefficient of Variation (BCV)

sqrt(genes$common.dispersion)
## [1] 0.1719392
plotBCV(genes)

1.5.12 Fit the model

fit <- glmQLFit(genes, model)

1.5.13 Test if either coefficient 2 or 3 has an effect

qlf <- glmQLFTest(fit, coef=2:3)

1.5.14 Run False Discovery Rate (FDR) correction

qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")
head(qlf$table, n=3)

Let’s check the number of significant genes

sum(qlf$table$QValue < 0.05)
## [1] 5710

1.5.15 Normalize expression

We will have the cpm() function directly generate log-scaled gene expression this time rather than doing it ourselves.

norm <- cpm(genes, log=T)

1.5.16 Subset the norm table based on differentially expressed genes

We will use the more stringent Q-value that we calculated earlier. Note that norm and qlf$table tables will be in the same row order, so this will work.

degs.norm <- norm[qlf$table$QValue < 0.05,]

1.5.17 Z-score across the genes

Note that we are using t() to transpose twice because the scale() function works on columns and we need to z-score on rows

degs.norm.z <- t(scale(t(degs.norm)))

1.5.18 Create a heatmap

Heatmap(degs.norm.z, show_row_names=F, name="One-way ANOVA, Z-scored log2 CPM",
        col=col_fun)

1.6 Contrasts

Let’s try a contrast where we compare the control group to the average of both treatments. We’ll remake the model matrix without an intercept for this, but using the same data as before.

1.6.1 Create the model

This time we will make the model without the intercept (add 0 into the formula).

model.alt <- model.matrix(~0 + treat)
model.alt
##    treatControl treatModel1 treatModel2
## 1             1           0           0
## 2             1           0           0
## 3             1           0           0
## 4             1           0           0
## 5             0           1           0
## 6             0           1           0
## 7             0           1           0
## 8             0           1           0
## 9             0           0           1
## 10            0           0           1
## 11            0           0           1
## 12            0           0           1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$treat
## [1] "contr.treatment"

1.6.2 Create a new edgeR object

genes.alt <- DGEList(counts=data_subset)

1.6.3 calculate TMM factors

genes.alt <- calcNormFactors(genes.alt)

1.6.4 Estimate dispersion

genes.alt <- estimateDisp(genes.alt, model.alt)

1.6.5 Calculate BCV from the common dispersion

This should be the same as last time: these models are ultimately equivalent in terms of how they model the data, but differ in how easily we can ask different questions of them

sqrt(genes.alt$common.dispersion)
## [1] 0.1719392

1.6.6 Fit the model

fit.alt <- glmQLFit(genes.alt, model.alt)

1.6.7 Try some contrasts

Let’s try a pair-wise comparison between model1 and model2.

comp1 <- makeContrasts(treatModel1 - treatModel2, levels=model.alt)

Let’s try a comparison of control vs average of model1 and model2.

comp2 <- makeContrasts(treatControl - (treatModel1 + treatModel2)/2, levels=model.alt)

1.6.8 Run qlf tests for these comparisons

qlf1 <- glmQLFTest(fit.alt, contrast=comp1)
qlf2 <- glmQLFTest(fit.alt, contrast=comp2)

1.6.9 Do B-H p-value adjustment for each

qlf1$table$QValue <- p.adjust(qlf1$table$PValue, method="BH")
qlf2$table$QValue <- p.adjust(qlf2$table$PValue, method="BH")

1.6.10 Get the normalized data for for the DEGs

We can use the same normalized expression that we got above, already in log-scale subset based on degs

degs.norm1 <- norm[qlf1$table$QValue < 0.05,]
degs.norm2 <- norm[qlf2$table$QValue < 0.05,]

How many genes are significant in both cases?

length(intersect(rownames(degs.norm1), rownames(degs.norm2)))
## [1] 1914

1.6.11 Z-score across the genes

degs.norm.z1 <- t(scale(t(degs.norm1)))
degs.norm.z2 <- t(scale(t(degs.norm2)))

1.6.12 Create the heatmaps

Heatmap 1: these are genes where model1 and model2 are different

Heatmap(degs.norm.z1, show_row_names=F, name="Contrast 1, Z-scored log2 CPM",
        col=col_fun)

Heatmap 2: these are genes where the average of model1 and model2 is dfferent from control

Heatmap(degs.norm.z2, show_row_names=F, name="Contrast 2, Z-scored log2 CPM",
        col=col_fun)

2 Afternoon

2.1 Two-way ANOVA

Set up and run a two-way ANOVA-type model with an interaction term.

ABOUT: These are data from a mouse RNA-seq experiment where animals were subjected to a disease model and allowed to recover; we have disease conditions: control (no disease), peak disease, and recovery. Two different nerve tissues were collected to profile gene expression: the dorsal root ganglion (DRG) and the sciatic nerve. The goal is to look for similarities and differences in the tissue responses to the disease model.

2.1.1 Load the edgeR library

This step is safe to skip if you have not restarted your R session.

library(edgeR)

2.1.2 Read in counts table

data <- read.delim("https://wd.cri.uic.edu/edgeR/twofactor_counts.txt", row.names=1)
dim(data)
## [1] 39056    18
head(data, n=3)

2.1.3 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/twofactor_metadata.txt", row.names=1)
head(metadata, n=3)

2.1.4 Make sure the metadata row order matches the data column order

Check the data dimensions again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 39056    18

2.1.5 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 20564    18

2.1.6 Define our factors and model matrix

disease <- factor(metadata[, 1])
tissue <- factor(metadata[, 2])
model <- model.matrix(~ disease + tissue + disease:tissue)

2.1.7 Create edgeR object

genes <- DGEList(counts=data_subset)

2.1.8 Calculate TMM factors

genes <- calcNormFactors(genes)

2.1.9 Estimate dispersion

genes <- estimateDisp(genes, model)

2.1.10 Calculate BCV from the common dispersion

sqrt(genes$common.dispersion)
## [1] 0.1542001
plotBCV(genes)

2.1.11 Fit the model

fit <- glmQLFit(genes, model)

2.1.12 Test each factor in turn

Keep in mind which coefficients these correspond to.

qlf.disease <- glmQLFTest(fit, coef=2:3)
qlf.tissue <- glmQLFTest(fit, coef=4)
qlf.inter <- glmQLFTest(fit, coef=5:6)

2.1.13 Do B-H p-value adjustment

qlf.disease$table$QValue <- p.adjust(qlf.disease$table$PValue, method="BH")
qlf.tissue$table$QValue <- p.adjust(qlf.tissue$table$PValue, method="BH")
qlf.inter$table$QValue <- p.adjust(qlf.inter$table$PValue, method="BH")

Let’s check the number of significant genes.

sum(qlf.disease$table$QValue < 0.05)
## [1] 758
sum(qlf.tissue$table$QValue < 0.05)
## [1] 13497
sum(qlf.inter$table$QValue < 0.05)
## [1] 2102

2.2 Filtering genes from two-way ANOVA

We want to find genes affected by the disease, but then evaluate them separately depending on whether they behaved differently or similarly between the tissues. We will use the interaction term to sort based on different vs similar effect.

Let’s first make a PCA plot with the tissue as plot shape and disease as color to help us visualize the effects of tissue and disease.

2.2.1 Load libraries

Note you can skip this step if you already have these libraries loaded from earlier in the day.

library(ggplot2)
library(ComplexHeatmap)
library(circlize)

2.2.2 Generate log-scaled normalized expression

norm <- cpm(genes, log=T)

2.2.3 Run PCA

pca <- prcomp(t(norm))

2.2.4 Setup dataframe for ggplot

pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))

2.2.5 Create labels for the X and Y axes

importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])

2.2.6 Create combined PCA plot

We will plot tissue as plot shape and disease as color. Note that we turn “clipping” off by using the coord_cartesian option; this allows the text labels to flow outside of the plot area.

ggplot(pca.data, aes(x=PC1, y=PC2, label=Sample, color=Disease, shape=Tissue)) + 
  geom_point() + geom_text(hjust=0, vjust=0, show.legend=F, color="black") +
  coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel)

2.2.7 Create a Screeplot (optional)

Since these PCA results have 18 PC, the default alphanumeric sort that is used will give us an odd looking plot. The X axis would go from PC1 to PC10 and PC11 instead of PC2 and PC3. To fix this, we will use the factor() function with the levels argument to order the PC as we would expect.

importance.pc <- factor(names(importance), levels=names(importance))
importance.df <- data.frame(PC=importance.pc, Variance=importance)
ggplot(importance.df, aes(x=PC, y=Variance)) + geom_col() +
labs(x="Principal Component", y="Variance Explained")

Above, we can see that the strongest effect (along PC1) is due to tissue. This matches the number of DEGs observed, and would be expected as tissues will generally have very different transcriptomic profiles.

We will determine a filtering strategy using the two-way ANOVA statistics, and make heatmaps from each option.

2.2.8 Z-score the normalized expression

norm.z <- t(scale(t(norm)))

2.2.9 Make a Boolean selection vector for effects due to disease or interaction term

disease.sel <- qlf.disease$table$QValue < 0.05
inter.sel <- qlf.inter$table$QValue < 0.05

2.2.10 Find genes with tissue-specific effects

Genes are significant for disease AND interaction.

different_effect <- norm.z[disease.sel & inter.sel,]
nrow(different_effect)
## [1] 310

2.2.11 Find genes with disease effect, but no interaction

Genes are significant for disease but NOT interaction.

shared_effect <- norm.z[disease.sel & !inter.sel,]
nrow(shared_effect)
## [1] 448

2.2.12 Prepare color scales and heatmap annotations

We will use the same color scale as before, with colors from blue to white to red, with max/min at +/-2 and the middle at 0.

col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

We will use the HeatmapAnnotation() method from ComplexHeatmap to add colored bars above the heatmap to show disease and tissue metadata.

col_labels <- HeatmapAnnotation(df=metadata)

2.2.13 Make heatmaps of each of these

We will utilize the column_split parameter to make side-by-side heatmaps for the tissues.

Heatmap(different_effect, column_title="Different effect", name="Z-scored log2 CPM",
  col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
  column_split=tissue)

Heatmap(shared_effect, column_title="Shared effect", name="Z-scored log2 CPM",
  col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
  column_split=tissue)

These look mostly the same, and they don’t show any differences based on disease! That’s because the effect size, and in turn the color scale, is dominated by the tissue differences.

What we really want to see are the disease effects in each tissue separately. We can fix this by z-scoring separately for each tissue.

2.2.14 Make separate norm tables for each tissue

norm.drg <- norm[, tissue == "DRG"]
norm.sciatic <- norm[, tissue == "Sciatic_nerve"]

2.2.15 Z-score each table

norm.drg.z <- t(scale(t(norm.drg)))
norm.sciatic.z <- t(scale(t(norm.sciatic)))

2.2.16 Combine the two tables back together

norm.z <- cbind(norm.drg.z, norm.sciatic.z)

2.2.17 Reorder the columns to match the original sample order

norm.z <- norm.z[, colnames(norm)]

2.2.18 Filter for effects

Note: this is the same as above, so don’t be afraid to copy-paste

different_effect <- norm.z[disease.sel & inter.sel,]
shared_effect <- norm.z[disease.sel & !inter.sel,]

2.2.19 Create heatmaps

We will utilize the column_split parameter to make side-by-side heatmaps for the tissues.

Heatmap(different_effect, column_title="Different effect", name="Z-scored log2 CPM",
  col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
  column_split=tissue)

Heatmap(shared_effect, column_title="Shared effect", name="Z-scored log2 CPM",
  col=col_fun, top_annotation=col_labels, show_row_names=F, show_column_names=F,
  column_split=tissue)

The “different effect” heatmap shows a variety of patterns, such as genes that are down-regulated in peak disease in sciatic nerve, followed by an increase to normal in recovery; these same genes increase in the DRG in peak disease, and remain elevated in recovery.

In the “shared effect” heatmap, we see a few different patterns, but they are shared across tissues.

These are some examples of how we would use different filtering criteria and/or normalization strategies to address different questions.

2.3 Repeated measures differential

NOTE: The columns of the data and the rows of the metadata are not in the same order for this data set. We will use the same trick to reorder the data within R.

ABOUT: These data are from a 16S amplicon experiment where the same cohort of animals was subject to one of three treatments (A, B, or C) over a 5-day time course. Samples were collected before treatmeant (Pre), and after 2, 3, and 5 days. The counts table are taxonomic summaries at a genus level. We’ll run a repeated measures test controlling for animal-to-animal differences and testing for differences between time points and treatments.

2.3.1 Read in counts data

data <- read.delim("https://wd.cri.uic.edu/edgeR/16S_counts.txt", row.names=1)
dim(data)
## [1] 218  60
head(data, n=3)
##                                                                                         S001_1_107 S002_2_108
## Bacteria;Verrucomicrobiota;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia       5414       6918
## Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Acinetobacter          1538          0
## Bacteria;Firmicutes;Clostridia;Oscillospirales;Ruminococcaceae;Ruminococcus                      9123       5877

2.3.2 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/16S_metadata.txt", row.names=1)
head(metadata, n=3)

Let’s check the distribution of animal IDs over time points to see which animals were subjected to different treatments. Note that some treatments are missing a couple animals.

table(metadata)
##         AnimalID
## Group    107 108 109 110 111 113 114 115 116 117 118 120 121 122 123
##   Day2.A   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
##   Day2.B   0   0   0   0   0   1   1   1   1   1   1   0   0   0   0
##   Day2.C   0   0   0   0   0   0   0   0   0   0   0   1   1   1   1
##   Day3.A   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
##   Day3.B   0   0   0   0   0   1   1   0   1   1   1   0   0   0   0
##   Day3.C   0   0   0   0   0   0   0   0   0   0   0   1   1   1   1
##   Day5.A   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
##   Day5.B   0   0   0   0   0   0   1   0   1   1   1   0   0   0   0
##   Day5.C   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0
##   Pre      1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

Let’s make sure the metadata row order matches the data column order

nrow(metadata)
## [1] 55
ncol(data)
## [1] 60

NOTE: sample numbers are different: check what’s different.

colnames(data)[!colnames(data) %in% rownames(metadata)]
## [1] "S016_NC1"    "S028_27_119" "S043_42_119" "S048_NC2"    "S058_56_119"

The missing samples are: * 3 samples from animal 119 – did not have a Pre sample, excluded from metadata * 2 “NC” samples: negative controls – we want to exclude from analysis

2.3.3 Subset the data to match the metadata

data <- data[, rownames(metadata)]
dim(data)
## [1] 218  55

2.3.4 Subset counts to taxa with >500 total counts

We have 218 samples, so this is an average of 2.3 counts per sample. We are also most interested in the more abundant taxa, lower abundant taxa are more likely to be data processing artifacts.

data_subset <- data[rowSums(data) > 500,]
dim(data_subset)
## [1] 121  55

2.3.5 Define factors and model matrix

treat <- factor(metadata[, 1])
subject <- factor(metadata[, 2])
model <- model.matrix(~ 0 + treat + subject)

2.3.6 Create edgeR object and fit the model

taxa <- DGEList(counts=data_subset)
taxa <- calcNormFactors(taxa)
taxa <- estimateDisp(taxa, model)

Check the BCV.

sqrt(taxa$common.dispersion)
## [1] 0.7809027

Fit the model.

fit <- glmQLFit(taxa, model)

2.3.7 Run contrasts

First, let’s look at coefficient names in the model.

colnames(model)
##  [1] "treatDay2.A" "treatDay2.B" "treatDay2.C" "treatDay3.A" "treatDay3.B"
##  [6] "treatDay3.C" "treatDay5.A" "treatDay5.B" "treatDay5.C" "treatPre"   
## [11] "subject108"  "subject109"  "subject110"  "subject111"  "subject113" 
## [16] "subject114"  "subject115"  "subject116"  "subject117"  "subject118" 
## [21] "subject120"  "subject121"  "subject122"  "subject123"

Let’s run a contrast for Day5.A vs Pre treatment.

contrast <- makeContrasts(treatDay5.A - treatPre, levels=model)
qlf <- glmQLFTest(fit, contrast=contrast)
qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")

2.3.8 Computed normalized values adjusted for repeated measures effects

Note that we will cover batch effect removal again in the next exercise. Here we are removing the batch effect based on the patient ID.

taxa.norm <- cpm(taxa, log=T)
taxa.norm <- removeBatchEffect(taxa.norm, subject)

2.3.9 Check the number of significant taxa.

taxa.signif <- rownames(qlf$table)[qlf$table$QValue < 0.05]
length(taxa.signif)
## [1] 13

2.3.10 Get the normalized values for significant taxa.

norm.signif <- taxa.norm[taxa.signif,]

2.3.11 Z-scale normalized values

norm.signif <- t(scale(t(norm.signif)))

2.3.12 Taxa names are too long for display, let’s fix that.

First remove “Other” taxa at the end (the $ anchors the search to the end of the string), then remove all higher-level taxa names.

rownames(norm.signif) <- gsub(";Other$", "", rownames(norm.signif))
rownames(norm.signif) <- gsub(".*;", "", rownames(norm.signif))

2.3.13 Create a color bar for the top of the heatmap based on treatment group names.

group.labels <- HeatmapAnnotation(Group=treat)

2.3.14 Create the heatmap

We will force the treatment groups to be together using the column_split option. We will also hide the column names and dendrogram in favor of our annotations using the column_title and show_column_dend option.

Heatmap(norm.signif, col=col_fun, name="Z-scored log2 CPM", column_split=treat,
        top_annotation=group.labels, column_title=NULL, show_column_dend=F)

2.3.15 Create a heatmap for group 5A vs pre

contrast.sel <- treat == "Day5.A" | treat == "Pre"
norm.signif.subset <- norm.signif[, contrast.sel]
subset.labels <- HeatmapAnnotation(Group=treat[contrast.sel])
Heatmap(norm.signif.subset, col=col_fun, name="Z-scored log2 CPM",
        column_split=treat[contrast.sel], top_annotation=subset.labels,
        column_title=NULL, show_column_dend=F)

Exercise to try at home: there are a LOT more contrasts we could run on these data. Can you figure out how to do Day3.A vs Pre? How about the average of Day2 vs Pre (how would you do an average of multiple groups in a contrast)?

2.4 Batch effect correction

ABOUT: These data are from an RNA-seq experiment comparing B cells from five mouse genotypes, with three replicates per group. However, the replicates were collected at different times, so there is a batch effect present, especially for replicate 1.

Reload the edgeR library in case you closed R studio

library(edgeR)

2.4.1 Read in counts data

data <- read.delim("https://wd.cri.uic.edu/edgeR/batch_counts.txt", row.names=1)
dim(data)
## [1] 23366    15
head(data, n=3)

2.4.2 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/batch_metadata.txt", row.names=1)
head(metadata, n=3)

2.4.3 Make sure the metadata row order matches the data column order

Check the data again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 23366    15

2.4.4 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14037    15

2.4.5 Define our factors

geno <- factor(metadata[, 1])
batch <- factor(metadata[, 2])

2.4.6 Create a model

First, let’s try the analysis without the batch correction. We will just model the genotype.

model <- model.matrix(~ geno)
head(model, n=3)
##   (Intercept) genoB genoC genoD genoE
## 1           1     0     0     0     0
## 2           1     0     0     0     0
## 3           1     0     0     0     0

2.4.7 Create the edgeR object, calculate TMM factors, and estimate dispersion

genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)

2.4.8 Calculate BCV from the common dispersion

sqrt(genes$common.dispersion)
## [1] 0.3957431

2.4.9 Fit the model

fit <- glmQLFit(genes, model)

2.4.10 Run ANOVA: test all coefficients other than the intercept

Also do the B-H p-value adjustment.

qlf <- glmQLFTest(fit, coef=2:5)
qlf$table$QValue <- p.adjust(qlf$table$PValue, method="BH")

Check the number of significant genes.

sum(qlf$table$QValue < 0.05)
## [1] 1528

2.4.11 Generate log-scaled normalized expression

norm <- cpm(genes, log=T)

2.4.12 Run PCA

pca <- prcomp(t(norm))

2.4.13 Create labels for the X and Y axes

importance <- summary(pca)$importance[2,]
xlabel <- sprintf("PC1 (%.2f%%)", 100 * importance[1])
ylabel <- sprintf("PC2 (%.2f%%)", 100 * importance[2])

2.4.14 Setup the data for ggplot

We will merge the PCA results with the metadata.

pca.data <- cbind(pca$x, metadata, Sample=rownames(metadata))

2.4.15 Create the plot

We will set the shape to be based on the batch and the color based on the genotype. Note that we have to force Batch to be seen as a factor instead of a continuous variable by using the as.factor() function. We will need to adjust its label in the legend to make up for this.

ggplot(pca.data, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=as.factor(Batch), color=Genotype)) + 
  geom_text(aes(label=Sample), hjust=0, vjust=0, show.legend=F) +
  coord_cartesian(clip='off') + labs(x=xlabel, y=ylabel, shape="Batch")

Note that all replicate 1 samples are to the left, and other samples are to the right. This separation is along PC1, which is capturing the largest portion of the variance. So, there is a batch effect, and it’s pretty strong.

2.4.16 Create a new model with the batch effect correction

The new model matrix with batch factor and no interaction term

batch.model <- model.matrix(~ geno + batch)
head(batch.model, n=3)
##   (Intercept) genoB genoC genoD genoE batch2 batch3
## 1           1     0     0     0     0      0      0
## 2           1     0     0     0     0      1      0
## 3           1     0     0     0     0      0      1

2.4.17 Create edgeR object, calculate TMM factors, and estimate dispersion

batch.genes <- DGEList(counts=data_subset)
batch.genes <- calcNormFactors(batch.genes)
batch.genes <- estimateDisp(batch.genes, batch.model)

2.4.18 Calculate BCV from the common dispersion

This should be much smaller!

sqrt(batch.genes$common.dispersion)
## [1] 0.2762293

2.4.19 Fit the new model

batch.fit <- glmQLFit(batch.genes, batch.model)

2.4.20 Test again for the genotype effect

This test looks the same as before, the batch terms are just “along for the ride”. Also do the B-H p-value adjustment.

batch.qlf <- glmQLFTest(batch.fit, coef=2:5)
batch.qlf$table$QValue <- p.adjust(batch.qlf$table$PValue, method="BH")

Check the number of significant genes. There should be many more!

sum(batch.qlf$table$QValue < 0.05)
## [1] 2564

2.4.21 Correct for batch effect

We just need to give the batch factor vector. We can start with the same normalized expression as above

batch.norm <- removeBatchEffect(norm, batch)

2.4.22 Run PCA

batch.pca <- prcomp(t(batch.norm))

2.4.23 Create labels for the X and Y axes

batch.importance <- summary(batch.pca)$importance[2,]
batch.xlabel <- sprintf("PC1 (%.2f%%)", 100 * batch.importance[1])
batch.ylabel <- sprintf("PC2 (%.2f%%)", 100 * batch.importance[2])

2.4.24 Setup the data for ggplot

batch.pca.data <- cbind(batch.pca$x, metadata, Sample=rownames(metadata))

2.4.25 Create the plot

ggplot(batch.pca.data, aes(x=PC1, y=PC2)) +
  geom_point(aes(shape=as.factor(Batch), color=Genotype)) +
  geom_text(aes(label=Sample), hjust=0, vjust=0, show.legend=F) +
  coord_cartesian(clip='off') +
  labs(x=batch.xlabel, y=batch.ylabel, shape="Batch")

This looks better, the separation based on batch is gone.

2.5 Analysis with a continuous variable

ABOUT: This is a clinical data set (46 samples), where some patients had a disease and others did not. The patients were also scored according to a separate phenotype. We are looking for genes that are different according to the disease, or the phenotype, or based on a combination (interaction) between these.

2.5.1 Read in counts table

data <- read.delim("https://wd.cri.uic.edu/edgeR/cont_counts.txt", row.names=1)
dim(data)
## [1] 57230    46

2.5.2 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/cont_metadata.txt", row.names=1)
head(metadata, n=3)

2.5.3 Make sure the metadata row order matches the data column order

Check the data again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 57230    46

2.5.4 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 19827    46

2.5.5 Create a histogram of the Phenotype Score

We will split the scores into blocks of 20 for plotting purposes using the seq() function.

breaks <- seq(0, max(metadata$Phenotype_Score) + 20, 20)
ggplot(metadata, aes(x=Phenotype_Score)) +
       geom_histogram(breaks=breaks, fill="white", col="black")

2.5.6 Check a boxplot showing any association between score and disease

ggplot(metadata, aes(x=Disease, y=Phenotype_Score)) +
       geom_boxplot()

2.5.7 Run the Kurskal-Wallis rank sum test

kruskal.test(Phenotype_Score ~ Disease, data=metadata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Phenotype_Score by Disease
## Kruskal-Wallis chi-squared = 0.27836, df = 1, p-value = 0.5978

This looks OK: if there was a strong association (small p-value), then the variables would be confounded and we couldn’t test them independently. So we can continue:

2.5.8 Define the variable/factor and model matrix

We will use the as.numeric() function for the score to ensure it is seen as a number instead of a factor.

score <- as.numeric(metadata[, 1])
disease <- factor(metadata[, 2])
model <- model.matrix(~ score + disease + score:disease)
head(model, n=3)
##   (Intercept) score diseaseTRUE score:diseaseTRUE
## 1           1    82           0                 0
## 2           1    90           0                 0
## 3           1    65           1                65

2.5.9 Create the edgeR object, calculate TMM factors, and estimate dispersion

This may take a minute - there are a lot of samples!

genes <- DGEList(counts=data_subset)
genes <- calcNormFactors(genes)
genes <- estimateDisp(genes, model)

2.5.10 Calculate BCV from the common dispersion

Note the BCV is much higher in this data set. Clinical data sets are often highly variable!

sqrt(genes$common.dispersion)
## [1] 2.027726

2.5.11 Fit the model

fit <- glmQLFit(genes, model)

Let’s take a look at the coefficients to get a sense of directionality for each gene.

head(fit$coefficients, n=3)
##                 (Intercept)        score diseaseTRUE score:diseaseTRUE
## ENSG00000000419  -10.931903  0.006592038   0.8829765      -0.013948637
## ENSG00000000457  -13.949042  0.010968106   3.7684653      -0.076915346
## ENSG00000000460   -9.714775 -0.049499178  -1.9882175       0.006747026

2.5.12 Now we test each variable/factor in turn

The continuous variable is always just 1 term.

qlf.score <- glmQLFTest(fit, coef=2)
qlf.disease <- glmQLFTest(fit, coef=3)
qlf.inter <- glmQLFTest(fit, coef=4)

2.5.13 Do B-H p-value adjustment for each

qlf.score$table$QValue <- p.adjust(qlf.score$table$PValue, method="BH")
qlf.disease$table$QValue <- p.adjust(qlf.disease$table$PValue, method="BH")
qlf.inter$table$QValue <- p.adjust(qlf.inter$table$PValue, method="BH")

Check the number of significant genes for each

sum(qlf.score$table$QValue < 0.05)
## [1] 1964
sum(qlf.disease$table$QValue < 0.05)
## [1] 1114
sum(qlf.inter$table$QValue < 0.05)
## [1] 847

3 Extra (Take Home) Exercises

3.1 Run all pairwise comparisons

This uses data associated with the one-way ANOVA (Exercises 1.5 and 1.6). But we will run pairwise comparisons between all pairs of the factor levels.

  • Loop over all pairs of levels and keep updating our results
  • With just 3 groups, we could write these out manually, but it’s better to use a loop, since we can extend the same code to a bigger design.
# read in, reconicle, and filter data as usual
data <- read.delim("https://wd.cri.uic.edu/edgeR/anova_counts.txt", row.names=1)
metadata <- read.delim("https://wd.cri.uic.edu/edgeR/anova_metadata.txt", row.names=1)
data <- data[, rownames(metadata)]
data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 16713    12
treat <- factor(metadata[, 1])
# start a new data frame to store the results in
pw_stats <- data.frame(Gene = rownames(data_subset))
# get a list of the levels, and the number
levs <- levels(treat)
nlevs <- length(levs)
# loop over all pairs of levels
for(i in 1:(nlevs-1)){
    for(j in (i + 1):nlevs){
        # names for these groups
        A <- levs[i]
        B <- levs[j]
        # get the logical vectors for these groups
        groupA <- A == treat
        groupB <- B == treat
        # subset the data for groupA or groupB data
        pw_metadata <- treat[as.logical(groupA + groupB)]
        pw_counts <- data_subset[, as.logical(groupA + groupB)]
        # run our edgeR commands
        pw_genes <- DGEList(counts=pw_counts, group=pw_metadata)
        pw_genes <- calcNormFactors(pw_genes)
        pw_genes <- estimateDisp(pw_genes)
        pw_test <- exactTest(pw_genes)
        pw_test$table$QValue <- p.adjust(pw_test$table$PValue, method="BH")
        # add the name of this comparison to the column names
        # edgeR does the comparisons as groupB / groupA, so
        # we list groupB first
        pw_name <- paste(B, A, sep="/")
        colnames(pw_test$table) = paste(pw_name, ":", colnames(pw_test$table))
        # add results to the pairwise list
        pw_stats <- cbind(pw_stats, pw_test$table)
    }
}
## Using classic mode.
## Using classic mode.
## Using classic mode.
head(pw_stats, n=3)

NOTE: you may find it useful to remake the above code as a function, which will allow you to rerun it again more easily on a new data set. For example:

pw_tests <- function(counts, groups){
  # start a new data frame to store the results in
  pw_stats <- data.frame(Gene = rownames(counts))
  # get a list of the levels, and the number
  levs <- levels(groups)
  nlevs <- length(levs)
  # loop over all pairs of levels
  for (i in 1:(nlevs - 1)) {
    for (j in (i + 1):nlevs) {
      # names for these groups
      A <- levs[i]
      B <- levs[j]
      # get the logical vectors for these groups
      groupA <- A == groups
      groupB <- B == groups
      # subset the data for groupA or groupB data
      pw_metadata <- treat[as.logical(groupA + groupB)]
      pw_counts <- data_subset[, as.logical(groupA + groupB)]
      # run our edgeR commands
      pw_genes <- DGEList(counts = pw_counts, group = pw_metadata)
      pw_genes <- calcNormFactors(pw_genes)
      pw_genes <- estimateDisp(pw_genes)
      pw_test <- exactTest(pw_genes)
      pw_test$table$QValue <- p.adjust(pw_test$table$PValue, method = "BH")
      # add the name of this comparison to the column names
      # edgeR does the comparisons as groupB / groupA, so
      # we list groupB first
      pw_name <- paste(B, A, sep = "/")
      colnames(pw_test$table) = paste(pw_name, ":", colnames(pw_test$table))
      # add results to the pairwise list
      pw_stats <- cbind(pw_stats, pw_test$table)
    }
  }
  return(pw_stats)
}

To run the function:

all_pairwise <- pw_tests(data_subset, treat)
head(all_pairwise, n=3)

3.2 Analysis with no replicates

PLEASE AVOID NEEDING TO DO THIS: you should always have biological replicates. But if you absolutely must analyze the data without replicates, at least do it by setting a fixed dispersion and computing approximate p-values, rather than simply looking at the fold-change. Using a fold-change threshold will bias you towards low-expressed genes. Note that the significance level of the q-values from this analysis are very dependent on the specific BCV you choose. However, it is fair to use the q-value to prioritize the genes based on which appear to be most strongly affected.

3.2.1 Read in counts table

data <- read.delim("https://wd.cri.uic.edu/edgeR/norep_counts.txt", row.names=1)
dim(data)
## [1] 24421     2

3.2.2 Read in metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/norep_metadata.txt", row.names=1)
metadata

3.2.3 Make sure the metadata row order matches the data column order

Check the data again to make sure we didn’t drop any columns!

data <- data[, rownames(metadata)]
dim(data)
## [1] 24421     2

3.2.4 Subset counts to genes with >20 total counts

data_subset <- data[rowSums(data) > 20,]
dim(data_subset)
## [1] 14947     2

3.2.5 Create the edgeR object and calculate TMM factors

genes <- DGEList(counts=data_subset, group=metadata[, 1])
genes <- calcNormFactors(genes)

3.2.6 Estimate dispersion

Since we don’t have the replicates samples to estimate dispersion, we will just pick a reasonable value. Let’s assume a BCV of 0.2 (20%).

bcv <- 0.2

3.2.7 Run differential analysis, specifying the dispersion to use

Here we will use the dispersion option using the square of the estimated BCV. Also run the B-H FDR correction.

stats <- exactTest(genes, dispersion=bcv^2)
stats$table$QValue <- p.adjust(stats$table$PValue, method="BH")

3.2.8 Sort by q-value to prioritize top genes

head(stats$table[order(stats$table$QValue),], n=3)

3.3 Gene ID conversion with biomaRt

Use the biomaRt package from Bioconductor. If this is the first time using biomaRt, we will need to install it with BiocManager.

if (!require("BiocManager", quietly=T))
    install.packages("BiocManager")
BiocManager::install("biomaRt")
library(biomaRt)

3.3.1 Read in the data table

data <- read.delim("https://wd.cri.uic.edu/edgeR/mouse_ensembl_rnaseq.txt", row.names=1)
head(data, n=3)

Note that the identifiers are:

  • Ensembl IDs (start with ENS).
  • For mouse (MUS). If no three-letter code is shown, then they’re human IDs.
  • For genes, rather than isoforms/transcripts or proteins, etc. (G).
  • Have version numbers (end in .[number]). Will need to remove versions before looking up symbols.

3.3.2 Get rid of version numbers in transcript names

This is is also necessary if you’re going to upload to a pathway database. This is an example of a REGULAR EXPRESSION (REGEX). We have seen a little of also in the grep and sed commands in linux. Very powerful!

Note that \\. matches the period and [0-9]* matches 1 or more numbers.

head(rownames(data), n=3)
## [1] "ENSMUSG00000051951.5" "ENSMUSG00000102851.1" "ENSMUSG00000103147.1"
newnames <- gsub("\\.[0-9]*", "", rownames(data))
head(newnames, n=3)
## [1] "ENSMUSG00000051951" "ENSMUSG00000102851" "ENSMUSG00000103147"

Now convert the IDs to gene symbols. Check the manual for the biomaRt package for more details. In particular:

  • The searchDatasets() function lets you search for available databases (for use in useDataset()).
  • The listAttributes() function lets you see what fields you can map your IDs to (for use in getBM()).

For today, we already know the database (mmusculus_gene_ensembl) and attributes we want (ensembl_gene_id and gene symbol AKA external_gene_name). Note that both of these commands may take some time to run, since they are fetching data from an online database hosted by Ensembl.

3.3.3 Get the database of mouse gene annotations

mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
gene_list <- getBM(filters="ensembl_gene_id", 
   attributes=c("ensembl_gene_id", "external_gene_name"),
   values=newnames, mart=mart)
gene_list = read.delim("gene_list.txt")
head(gene_list, n=3)

3.3.4 Note that not ALL IDs may get matched

length(newnames)
## [1] 39056
nrow(gene_list)
## [1] 38535

3.3.5 Also note that gene symbols are NOT unique relative to the gene IDs

length(unique(gene_list[, 1]))
## [1] 38535
length(unique(gene_list[, 2]))
## [1] 38492

Now that you have the list, you can combine with an existing data frame. In the example below, we combine with the counts table using merge().

NOTE: usually you would want to keep the original IDs as the main identifiers throughout your analysis in edgeR, and then merge the symbols or other information with your normalized expression (CPM) or differential stats tables AFTER running the analysis. You can’t have a mix of symbols and other strings in your counts table when processing through edgeR.

3.3.6 Replace the row names with the IDs after stripping off the version numbers

rownames(data) = newnames

3.3.7 Merge based on ensembl_gene_id and rownames

Keep all entries by using the all option. Unmatched IDs will have NA values for the gene symbol.

data.named <- merge(data, gene_list, by.x="row.names", by.y="ensembl_gene_id", all=T)
head(data.named, n=3)

3.4 Two-way ANOVA: interaction but no independent effect

(Start from the end of Exercise 2.2 to lead into this exercise.)

There’s another way we could consider filtering for genes with a significant interaction: just filter directly on the interaction term, ignoring the independent effects of disease. That is:

any_inter <- norm.z[inter.sel,]

Or, we could separate just the significant interactions from those that also have independent disease effects:

inter_only <- norm.z[inter.sel & !disease.sel,]

Try a heatmap with the second option:

Heatmap(inter_only, col=col_fun, name="Z-scored log2 CPM",
        top_annotation=col_labels, show_row_names=F, show_column_names=F,
        column_split = tissue)

It looks like we have a clear effect in sciatic nerve, but the picture in DRG is murkier. Assuming you’re looking at norm.z with values z-scored separately for each tissue, it’s a little hard to assess the relative effect in DRG because we’ve z-scored.

We will try running one-way ANOVA tests separately in each tissue.

samples.drg <- tissue=="DRG"
samples.sn <- tissue=="Sciatic_nerve"
data.drg <- data_subset[, samples.drg]
data.sn <- data_subset[, samples.sn]
disease.drg <- disease[samples.drg]
disease.sn <- disease[samples.sn]
model.drg <- model.matrix(~disease.drg)
model.sn <- model.matrix(~disease.sn)
genes.drg <- DGEList(data.drg)
genes.sn <- DGEList(data.sn)
genes.drg <- calcNormFactors(genes.drg)
genes.sn <- calcNormFactors(genes.sn)
genes.drg <- estimateDisp(genes.drg, model.drg)
genes.sn <- estimateDisp(genes.sn, model.sn)
fit.drg <- glmQLFit(genes.drg, model.drg)
fit.sn <- glmQLFit(genes.sn, model.sn)
test.drg <- glmQLFTest(fit.drg, coef=2:3)
test.sn <- glmQLFTest(fit.sn, coef=2:3)
test.drg$table$QValue <- p.adjust(test.drg$table$PValue, method="BH")
test.sn$table$QValue <- p.adjust(test.sn$table$PValue, method="BH")
# count the number of DEGs
sum(test.drg$table$QValue < 0.05)
## [1] 921
sum(test.sn$table$QValue < 0.05)
## [1] 3684
# which of these genes were detected before in the "disease" factor?
sum(test.drg$table$QValue < 0.05 & disease.sel)
## [1] 548
sum(test.sn$table$QValue < 0.05 & disease.sel)
## [1] 365

It does look like the general disease effect is stronger in sciatic nerve. Remake the heatmap, with a row annotation indicating the significance for each tissue separately.

# prepare annotations for the heatmap based on -log10 FDR
signif.drg <- -log10(test.drg$table$QValue)
signif.sn <- -log10(test.sn$table$QValue)
labels.drg <- signif.drg[!disease.sel & inter.sel]
labels.sn <- signif.sn[!disease.sel & inter.sel]

# set an alternate color scale for significance levels
col_signif <- colorRamp2(c(0, -log10(0.05), 5), c("white", "red", "yellow"))
# use the same color scale for both annotations
anova.label <- rowAnnotation(DRG=labels.drg, SN=labels.sn,
        col=list(DRG=col_signif, SN=col_signif))
Heatmap(inter_only, col=col_fun, name="Z-scored log2 CPM",
        top_annotation=col_labels, show_row_names=F, show_column_names=F,
        column_split = tissue, left_annotation = anova.label)

All of these show a significant effect in sciatic nerve, whereas only some of them show it in DRG.

3.5 Advanced PCA plots

For this exercise, we’ll use a shotgun metagenomics data set, with taxonomic quantification summarized at a phylum level. These are data obtained from a mouse testing a surgery injury model, and a treatment for the injury. We have two experimental factors: (1) Injury, with levels Naive, Sham, and Surgery; and (2) Treatment, with levels treated and control.

3.5.1 Load the edgeR and ggplot2 libraries

library(edgeR)
library(ggplot2)

3.5.2 Read in the counts table

Observe that the features are taxonomic names, down to phylum level. All are for bacteria.

data <- read.delim("https://wd.cri.uic.edu/edgeR/metagenomics_counts.txt", row.names=1)
head(rownames(data), n=3)
## [1] "sk_Bacteria;k__Bacteria incertae sedis;p__Caldiserica"                 
## [2] "sk_Bacteria;k__Bacteria incertae sedis;p__Candidatus Levybacteria"     
## [3] "sk_Bacteria;k__Bacteria incertae sedis;p__Candidatus Magasanikbacteria"

3.5.3 Read in the metadata table

metadata <- read.delim("https://wd.cri.uic.edu/edgeR/metagenomics_metadata.txt",
  row.names=1)

3.5.4 Simplify taxonomic names to just the phylum

rownames(data) <- gsub("^.*;p__", "", rownames(data))
head(rownames(data), n=3)
## [1] "Caldiserica"                  "Candidatus Levybacteria"     
## [3] "Candidatus Magasanikbacteria"

3.5.5 Match metadata and data

data <- data[, rownames(metadata)]

3.5.6 Filter out low abundance taxa

data_subset <- data[rowSums(data) > 20,]

3.5.7 Run just a few steps in edgeR to get normalized abundance

Yyou can run differential stats also, but you should be an expert on that by now so we omit those steps here.

taxa <- DGEList(counts = data_subset)
taxa <- calcNormFactors(taxa)
taxa.norm <- cpm(taxa, log=T)

3.5.8 Run PCA analysis and make a data frame with metadata

pca <- prcomp(t(taxa.norm))
pca.data <- cbind(pca$x, metadata)

3.5.9 Create plots colored by Injury and Treatment

ggplot(pca.data, aes(x=PC1, y=PC2, color=Injury)) + geom_point()

ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point()

3.5.10 Facet by Treatment

ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point() +
   facet_grid(~Injury)

3.5.11 Plot each factor, add confidence ellipse

ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point() +
  facet_grid(~Injury) + stat_ellipse()

3.5.12 Add counts for Bacteroidetes and Firmicutes

The Firmicutes/Bacteroidetes ratio is a common measure of the health of a gut microbiome. We can add that value to the PCA plot as the dot colors.

pca.data.expanded <- cbind(pca.data,
   Bacteroidetes = t(data["Bacteroidetes",]),
   Firmicutes = t(data["Firmicutes",]))

3.5.13 Plot colored by Firmicutes/Bacteriodetes ratio

ggplot(pca.data.expanded, aes(x=PC1, y=PC2, color=Firmicutes / Bacteroidetes)) +
   geom_point()

3.5.14 Plot colored by the log2-ratio

ggplot(pca.data.expanded, aes(x=PC1, y=PC2, color=log2(Firmicutes / Bacteroidetes))) +
  geom_point()

3.5.15 Make a biplot

This allows us to visualize the contribution of each feature on the principle components. Note that you may want to be careful with this plot if you have a huge number of features, but with a relatively small number of taxa it will work.

First, get the rotation matrix as a data frame.

rotation <- data.frame(taxa=rownames(pca$rotation), pca$rotation)

Next determine a scalar to multiply the vector lengths by so that the scales line correctly.

mult.pc1 <- max(pca.data[, "PC1"]) - min(pca.data[, "PC1"]) /
           (max(rotation[, "PC1"]) - min(rotation[, "PC1"]))
mult.pc2 <- max(pca.data[, "PC2"]) - min(pca.data[, "PC2"]) /
           (max(rotation[, "PC2"]) - min(rotation[, "PC2"]))
mult <- min(mult.pc1, mult.pc2)
rotation$PC1 = mult * rotation$PC1
rotation$PC2 = mult * rotation$PC2

Start creating the plot of the PCA data colored by treatment. We will separate this into multiple lines to demonstrate how you can add plot components together.

plot <- ggplot(pca.data, aes(x=PC1, y=PC2, color=Treatment)) + geom_point()

Add vertical and horizontal lines for the middle of the biplot using geom_hline() and geom_vline() methods.

plot <- plot + geom_hline(yintercept=0) + geom_vline(xintercept=0)

Add text labels for each taxa. Set clip=“off” to let text labels go outside the plot area.

plot <- plot + geom_text(data=rotation, aes(x=PC1, y=PC2, label=taxa),
       size=3, hjust=0, vjust=0, color="red") + coord_equal(clip='off')

Add arrows for the biplot: from the origin (0,0) to the PC coordinate in the rotation matrix.

plot <- plot + geom_segment(data=rotation, aes(x=0, y=0, xend=PC1, yend=PC2),
   arrow=arrow(length=unit(0.2, "cm")), color="red")

Finally, print out the plot.

plot